Goto

Collaborating Authors

 barlow twin


Neural Diversity Regularizes Hallucinations in Language Models

Chakrabarti, Kushal, Balachundhar, Nirmal

arXiv.org Artificial Intelligence

Language models continue to hallucinate despite increases in parameters, compute, and data. We propose neural diversity -- decorrelated parallel representations -- as a principled mechanism that reduces hallucination rates at fixed parameter and data budgets. While existing mitigation strategies largely target accuracy, we provide the first formal tail bounds for hallucination probability in ensembled language models, reframing it as a second-moment reliability problem and explaining 94.3% of empirical reliability variation seen across parallel configurations. We introduce ND-LoRA (Neural Diversity Low-Rank Adaptation), combining parallel LoRA adapters with Barlow Twins regularization, and reduce hallucinations by up to 25.6% (and 14.6% on average) while preserving general accuracy. Ablations show LoRA adapters and regularization act synergistically, causal interventions prove neurodiversity as the mediating factor and correlational studies indicate scale: a 0.1% neural correlation increase is associated with a 3.8% hallucination increase. Finally, task-dependent optimality emerges: different tasks require different optimal amounts of neurodiversity. Together, our results highlight neural diversity as a third axis of scaling -- orthogonal to parameters and data -- to improve the reliability of language models at fixed budgets.


Pre-train to Gain: Robust Learning Without Clean Labels

Szczecina, David, Pellegrino, Nicholas, Fieguth, Paul

arXiv.org Artificial Intelligence

Training deep networks with noisy labels leads to poor generalization and degraded accuracy due to overfitting to label noise. Existing approaches for learning with noisy labels often rely on the availability of a clean subset of data. By pre-training a feature extractor backbone without labels using self-supervised learning (SSL), followed by standard supervised training on the noisy dataset, we can train a more noise robust model without requiring a subset with clean labels. We evaluate the use of SimCLR and Barlow~Twins as SSL methods on CIFAR-10 and CIFAR-100 under synthetic and real world noise. Across all noise rates, self-supervised pre-training consistently improves classification accuracy and enhances downstream label-error detection (F1 and Balanced Accuracy). The performance gap widens as the noise rate increases, demonstrating improved robustness. Notably, our approach achieves comparable results to ImageNet pre-trained models at low noise levels, while substantially outperforming them under high noise conditions.


A Appendix

Neural Information Processing Systems

M) null /τ null, (10) which is derived by considering Equation 4. To simplify the equation, we hold L M is the dimensional mask. Note that the scalar derivation, e.g., the MetaMask's training paradigm as follows. In order to prove Theorem 5.2 and the conclusion that the bounds of supervised cross-entropy loss A.2.1 Proof for the Equality Part To prove Φ null g To prove Equation 20, we demonstrate an evidence example in Figure 5. The reason behind such a phenomenon is that, following Theorem 5.1, the self-paced dimensional mask jointly enhances the gradient Being aware of proofs in Section A.2.1 and Section A.2.2, we confirm the validation of Theorem Then, we bring Theorem 5.2 into Theorem 5.1 to derive the comparison of the lower bounds of Therefore, the lower bound obtained by the masked representation, i.e., MetaMask, is larger than the Concretely, we conclude that our approach can better bound the downstream classification risk, i.e., However, our dimensional confounder is defined as a negative factor that may lead to model degradation, which is proposed from the dimensional perspective. MetaMask using a trick of fixed learning rate instead of the cosine annealing strategy.


On the Optimal Representation Efficiency of Barlow Twins: An Information-Geometric Interpretation

Zhang, Di

arXiv.org Machine Learning

Self-supervised learning (SSL) has emerged as a dominant paradigm for learning representations from unlabeled data [5]. Among various SSL approaches, methods based on redundancy reduction, such as Barlow Twins [7], have demonstrated exceptional performance. These methods operate on the principle of making the cross-correlation matrix between two distorted views of the data close to the identity matrix. While empirically successful, a deep theoretical explanation of why this objective leads to high-quality representations is still developing. A key desirable property of a good representation space is efficiency--the degree to which it utilizes its available dimensions to capture semantically meaningful, non-redundant information. An inefficient representation might suffer from dimensional collapse [4], where many dimensions are redundant or encode correlated information, limiting the representation's expressivity and suitability for downstream tasks. In this paper, we address this gap by proposing a novel information-geometric framework [1] for quantifying representation efficiency. Our core contributions are threefold: 1. We formally define the statistical manifold of representations and introduce a measure of representation efficiency η based on the spectrum of the average Fisher Information Matrix (FIM).


Contrastive Self-Supervised Learning at the Edge: An Energy Perspective

Famá, Fernanda, Pereira, Roberto, Kalalas, Charalampos, Dini, Paolo, Qendro, Lorena, Kawsar, Fahim, Malekzadeh, Mohammad

arXiv.org Artificial Intelligence

Abstract--While contrastive learning (CL) shows considerable promise in self-supervised representation learning, its deployment on resource-constrained devices remains largely underexplored. The substantial computational demands required for training conventional CL frameworks pose a set of challenges, particularly in terms of energy consumption, data availability, and memory usage. We conduct an evaluation of four widely used CL frameworks: SimCLR, MoCo, SimSiam, and Barlow Twins. We focus on the practical feasibility of these CL frameworks for edge and fog deployment, and introduce a systematic benchmarking strategy that includes energy profiling and reduced training data conditions. Our findings reveal that SimCLR, contrary to its perceived computational cost, demonstrates the lowest energy consumption across various data regimes. Finally, we also extend our analysis by evaluating lightweight neural architectures when paired with CL frameworks. Our study aims to provide insights into the resource implications of deploying CL in edge/fog environments with limited processing capabilities and opens several research directions for its future optimization. Over the years, a variety of contrastive learning (CL) approaches have been developed, including popular frameworks such as SimCLR [1], MoCo [2], BYOL [3], SimSiam [4], and Barlow Twins [5], each offering specific advantages and trade-offs. These frameworks aim to learn representations by distinguishing between similar (positive) and dissimilar (negative) samples in a latent space. While some methods rely on large negative sample sets to achieve high-quality representations, others bypass the need for negative pairs through momentum encoders or predictor networks.


DinoTwins: Combining DINO and Barlow Twins for Robust, Label-Efficient Vision Transformers

Podsiadly, Michael, Lay, Brendon K

arXiv.org Artificial Intelligence

Training AI models to understand images without costly labeled data remains a challenge. We combine two techniques--DINO (teacher-student learning) and Barlow Twins (redundancy reduction)--to create a model that learns better with fewer labels and less compute. While both DINO and Barlow Twins have independently demonstrated strong performance in self-supervised learning, each comes with limitations--DINO may be sensitive to certain augmentations, and Barlow Twins often requires batch sizes too large to fit on consumer hardware. By combining the redundancy-reduction objective of Barlow Twins with the self-distillation strategy of DINO, we aim to leverage their complementary strengths. We train a hybrid model on the MS COCO dataset using only 10\% of labeled data for linear probing, and evaluate its performance against standalone DINO and Barlow Twins implementations. Preliminary results show that the combined approach achieves comparable loss and classification accuracy to DINO while maintaining strong feature representations. Attention visualizations further suggest improved semantic segmentation capability in the hybrid model. This combined method offers a scalable, label-efficient alternative for training ViTs in resource-constrained environments.



Enhancing User Sequence Modeling through Barlow Twins-based Self-Supervised Learning

Liu, Yuhan, Ning, Lin, Wu, Neo, Singhal, Karan, Mansfield, Philip Andrew, Berlowitz, Devora, Prakash, Sushant, Green, Bradley

arXiv.org Artificial Intelligence

User sequence modeling is crucial for modern large-scale recommendation systems, as it enables the extraction of informative representations of users and items from their historical interactions. These user representations are widely used for a variety of downstream tasks to enhance users' online experience. A key challenge for learning these representations is the lack of labeled training data. While self-supervised learning (SSL) methods have emerged as a promising solution for learning representations from unlabeled data, many existing approaches rely on extensive negative sampling, which can be computationally expensive and may not always be feasible in real-world scenario. In this work, we propose an adaptation of Barlow Twins, a state-of-the-art SSL methods, to user sequence modeling by incorporating suitable augmentation methods. Our approach aims to mitigate the need for large negative sample batches, enabling effective representation learning with smaller batch sizes and limited labeled data. We evaluate our method on the MovieLens-1M, MovieLens-20M, and Yelp datasets, demonstrating that our method consistently outperforms the widely-used dual encoder model across three downstream tasks, achieving an 8%-20% improvement in accuracy. Our findings underscore the effectiveness of our approach in extracting valuable sequence-level information for user modeling, particularly in scenarios where labeled data is scarce and negative examples are limited.


Supervised Pretraining for Material Property Prediction

Rahman, Chowdhury Mohammad Abid, Romero, Aldo H., Gyawali, Prashnna K.

arXiv.org Artificial Intelligence

Accurate prediction of material properties facilitates the discovery of novel materials with tailored functionalities. Deep learning models have recently shown superior accuracy and flexibility in capturing structure-property relationships. However, these models often rely on supervised learning, which requires large, well-annotated datasets an expensive and time-consuming process. Self-supervised learning (SSL) offers a promising alternative by pretraining on large, unlabeled datasets to develop foundation models that can be fine-tuned for material property prediction. In this work, we propose supervised pretraining, where available class information serves as surrogate labels to guide learning, even when downstream tasks involve unrelated material properties. We evaluate this strategy on two state-of-the-art SSL models and introduce a novel framework for supervised pretraining. To further enhance representation learning, we propose a graph-based augmentation technique that injects noise to improve robustness without structurally deforming material graphs. The resulting foundation models are fine-tuned for six challenging material property predictions, achieving significant performance gains over baselines, ranging from 2% to 6.67% improvement in mean absolute error (MAE) and establishing a new benchmark in material property prediction. This study represents the first exploration of supervised pertaining with surrogate labels in material property prediction, advancing methodology and application in the field.


Projection Head is Secretly an Information Bottleneck

Ouyang, Zhuo, Hu, Kaiwen, Zhang, Qi, Wang, Yifei, Wang, Yisen

arXiv.org Artificial Intelligence

Recently, contrastive learning has risen to be a promising paradigm for extracting meaningful data representations. Among various special designs, adding a projection head on top of the encoder during training and removing it for downstream tasks has proven to significantly enhance the performance of contrastive learning. However, despite its empirical success, the underlying mechanism of the projection head remains under-explored. In this paper, we develop an in-depth theoretical understanding of the projection head from the information-theoretic perspective. By establishing the theoretical guarantees on the downstream performance of the features before the projector, we reveal that an effective projector should act as an information bottleneck, filtering out the information irrelevant to the contrastive objective. Based on theoretical insights, we introduce modifications to projectors with training and structural regularizations. We believe our theoretical understanding on the role of the projection head will inspire more principled and advanced designs in this field. In recent years, contrastive learning has emerged as a promising representation learning paradigm and exhibited impressive performance without supervised labels (Chen et al., 2020; He et al., 2020; Zbontar et al., 2021). The core idea of contrastive learning is quite simple, that is to pull the augmented views of the same samples (i.e., positive samples) together while pushing the independent samples (i.e., negative samples) away. To improve the downstream performance of contrastive learning, researchers have proposed various special training objectives and architecture designs (Grill et al., 2020; Wang et al., 2021; Guo et al., 2023; Wang et al., 2023; 2024; Du et al., 2024). Among them, one of the most widely-used techniques is the projection head (i.e., projector) (Chen et al., 2020), which is a shallow layer following the backbone during pretraining and is discarded in downstream tasks like image classification and object detection. It has been shown that the features before the projector (denoted as encoder features) exhibit much better downstream performance than the features after the projector (denoted as projector features) across various applications (Jing et al., 2021; Gupta et al., 2022). Inspired by the success of the projection head in contrastive learning, researchers also extend this architecture to other representation learning paradigms and achieve significant improvements (Sariyildiz et al., 2022; Zhou et al., 2021). However, although the projection head has been widely adopted, the understanding of the underlying mechanism behind it is still quite limited. In this paper, we aim to establish a theoretical analysis of the relationship between the projection head and the downstream performance of contrastive learning.